146
Applications in Natural Language Processing
5.11
Post-Training Embedding Binarization for Fast Online Top-K
Passage Matching
To lower the complexity of BERT, the recent state-of-the-art model ColBERT[113] employs
Contextualized Late Interaction paradigm to independently learn fine-grained query-passage
representations. It comprises: (1) a query encoder fQ, (b) a passage encoder fD, and (3)
a query-passage score predictor. Specifically, given a query q and a passage d, fQ and fD
encode them into a bag of fixed-size embeddings Eq and Ed as follows:
Eq = Normalize(CNN(BERT(”[Q]q0q1 · ql”))),
Ed = Filter(Normalize(CNN(BERT(”[D]d0d1 · dn”)))),
(5.44)
where q and d are tokenized into tokens q0qq · ql and d0d1 · dn by BERT-based WordPiece,
respectively. [Q] and [D] indicate the sequence types.
Despite the advances of ColBERT over the vanilla BERT model, its massive computa-
tion and parameter burden still hinder the deployment on edge devices. Recently, Chen et
al.[40] proposed Bi-ColBERT to binarize the embedding to relieve the computation burden.
Bi-ColBERT involves (1) semantic diffusion to hedge the information loss against embed-
ding binarization, and (2) approximation of Unit Impulse Function [18] for more accurate
gradient estimation.
5.11.1
Semantic Diffusion
Binarization with sign(·) inevitably smoothes the embedding informativeness into the bina-
rized space, e.g., −1, 1d regardless of its original values. Thus, intuitively, one wants to avoid
condensing and gathering informative latent semantics in (relatively-small) sub-structures
of embedding bags. In other words, the aim falls into diffusing the embedded semantics in
all embedding dimensions as one effective strategy to hedge the inevitable information loss
caused by the numerical binarization and retain the semantic uniqueness after binarization
as much as possible.
Recall in singular value decomposition (SVD), singular values and vectors reconstruct
the original matrix; normally, large singular values can be interpreted to associate with
major semantic structures of the matrix [242]. To achieve semantic diffusion via normalizing
singular values for equalizing their respective contributions in constituting latent semantics,
the authors introduced a lightweight semantic diffusion technique as follows. Concretely, let
I denote the identity matrix and a standard normal random vector p(h) ∈Rd. During
training, the diffusion vector p(h) is iteratively updated as p(h) = ET
q Eqp(h−1). Then, the
projection matrix Pq is obtained via:
Pq = p(h)p(h)T
||p(h)||2
2
.
(5.45)
Then, the semantic-diffused embedding with the hyper-parameter ϵ ∈(0, 1) as:
ˆ
Eq = Eq(I −ϵPq).
(5.46)
Compare to the unprocessed embedding bag, i.e., Eq, embedding presents a diffused seman-
tic structure with a more balanced spectrum (distribution of singular values) in expectation.